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Abstract: 

In current high-speed digital signal-processing (DSP) architectures, the 
Residue Number System (RNS) has an important role to play. RNS 
implementations have a highly modular structure, and are not dependent 
upon large binary arithmetic elements. RNS implementations become more 
attractive when combined with the advantages offered by VLSI fabrication 
technology. In this paper, a novel design methodology has been developed for 
RNS structures, based on using look-up tables, which takes into consideration 
the unique features and requirements of RNS. The paper discusses the 
following three phases: 1) developing a look-up table layout model, which is 
used to derive relationships between the size of each modulus and both chip 
area and time; this model supports all types of moduli; 2) selecting the most 
efficient layout according to the design requirements; the procedure allows 
the designer to control the area, time, or the configuration of the memory 
module required for implementing a modulo look-up table; 3) proposing a set 
of multi-look-up table modules, to be used as building block units for 
implementing digital signal-processing architectures. The paper uses two 
examples to -illustrate the use of the modules in phase 3). 
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Abstract: 

Moving routing tables from RAM to custom or semicustom VLSI 
can lower cost and boost performance. The routing table problem 
is presented by discussing the available architectures and how 
they are related. It is shown that simple table lookup is just a 
special case of the standard trie structure and that the use of 
partitioning combined with the trie structure provides a continuum 
that can lead to a CAM implementation at one extreme. The 
high-level tradeoffs in the choice of various parameters for the trie 
are estimated. A careful choice of word size can balance the 
requirements for speed with the costs of area. Also considered are 
the costs and benefits of splitting the table into a number of tries, 
which are searched simultaneously. VLSI implementations are 
outlined, and the costs are compared. General CAM structures are 
not needed for the routing table application, and custom CAMs 
can be very efficient. Tries, however, can be competitive in many 
cases, due to the resources available for building conventional 
memories 
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Abstract: 

Residue number representation has become a viable alternative 
for fast, area-efficient VLSI realization of high-performance signal 
processing hardware. Wider applicability and improved 
cost/performance of residue-based VLSI implementations of signal 
processing algorithms are critically dependent on efficient 
realization of I/O data conversions and several other support 
functions that are treated in this paper. Alternate table-lookup 
schemes for conversion of binary to residue numbers, and vice 
versa, are presented. Input widths of lookup tables can be 
changed freely through a repartitioning scheme to provide 
tradeoffs between table size (area) and computation speed. 
Improved variants of VLSI-based pipelined binary-to-residue 
converters are derived along with balanced, highly regular, 
pipelined architectures for residue-to-binary conversion in VLSI. 
The input repartitioning method is shown to be applicable to other 
important residue number system operations, including sign 
detection, mixed-radix conversion, and base extension 
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Abstract: 

The field programmable gate-array (FPGA) has become an 
important technology in VLSI ASIC designs. In the past few years, 
a number of heuristic algorithms have been proposed for 
technology mapping in lookup-table (LUT) based FPGA designs, 
but none of them guarantees optimal solutions for general 
Boolean networks and little is known about how far their solutions 
are away from the optimal ones. This paper presents a theoretical 
breakthrough which shows that the LUT-based FPGA technology 
mapping problem for depth minimization can be solved optimally 
in polynomial time. A key step in our algorithm is to compute a 
minimum height K-feasible cut in a network, which is solved 
optimally in polynomial time based on network flow computation. 
Our algorithm also effectively minimizes the number of LUT's by 
maximizing the volume of each cut and by several 
post-processing operations. Based on these results, we have 
implemented an LUT-based FPGA mapping package called 
FlowMap. We have tested FlowMap on a large set of benchmark 
examples and compared it with other LUT-based FPGA mapping 
algorithms for delay optimization, including Chortle-d, 
MIS-pga-delay, and DAG-Map. FlowMap reduces the LUT network 
depth bv up to 7% and reduces the number of LUT's by up to 50% 
compared to the three previous methods 
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Abstract: 

To meet its performance goal of executing most instructions in a single clock, 
the i486 microprocessor uses a cache memory that is integrated on the silicon 
die as the processor. The integrated cache is capable of performing one 
memory read or write each clock cycle. The size and organization of the cache 
were selected based on available silicon area and on the results of 
trace-driven simulation. An external bus was devised featuring burst data 
transfers to quickly fill cache lines and provisions to ensure cache coherency 
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